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Abstract. In this position paper, I first describe a new perspective on 
machine learning (ML) by four basic problems (or levels), namely, “What 
to learn?”, “How to learn?”, “What to evaluate?”, and “What to ad¬ 
just?”. The paper stresses more on the first level of “What to learn?”, 
or “Learning Target Selection”. Towards this primary problem within 
the four levels, I briefly review the existing studies about the connection 
between information theoretical learning (ITL [T]) and machine learn¬ 
ing. A theorem is given on the relation between the empirically-defined 
similarity measure and information measures. Finally, a conjecture is 
proposed for pursuing a unified mathematical interpretation to learning 
target selection. 
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“From the Tao comes one, from one comes two, from two comes three, 
and from three comes all things. ” |2j 

- by Lao Tzu (ca. 600-500 BCE) 

“Nature is the realization of the simplest conceivable mathematical ideas. ” 

0 

- by Albert Einstein (1879-1955) 


1 Introduction 

Machine learning is the study and construction of systems that can learn from 
data. The systems are called learning machines. When Big Data emerges increas¬ 
ingly, more learning machines are developed and applied in different domains. 
However, the ultimate goal of machine learning study is insight, not machine 
itself. By the term insight I mean learning mechanisms in descriptions of mathe¬ 
matical principles. In a loose sense, learning mechanisms can be regarded as the 
natural entity. As the “Tao (iff)” reflects the most fundamental of the universe 
by Lao Tzu (^T), Einstein suggests that we should pursue the simplest mathe¬ 
matical interpretations to the nature. Although learning mechanisms are related 


to the subjects of psychology, cognitive and brain science, this paper stresses on 
the exploration of mathematical principles for interpretation of learning mech¬ 
anisms. Up to now, we human beings are still far away from deep understand¬ 
ing ourself on learning mechanisms in terms of mathematical principles. It is 
the author’s belief that ‘‘mathematical-principle-based machine” might be more 
important and critical than “brain-inspired machine” in the study of machine 
learning. 

The purpose of this position paper is to put forward a new perspective 
and a novel conjecture within the study of machine learning. In what follows 
I will present four basic problems (or levels) in machine learning. The study 
on information theoretical learning is briefly reviewed. A theorem between the 
empirically-defined similarity measures and information measures are given. Based 
on the existing investigations, a conjecture is proposed in this paper. 


2 Four basic problems (or levels) in machine learning 

For information processing by a machine, in the 1980’s, Marr |4j proposed a 
novel methodology by three distinct yet complementary levels, namely, “Compu¬ 
tational theory ”, “Representation and algorithm” , and “Hardware implementa¬ 
tion”, respectively. Although the three levels are “ coupled ” loosely, the distinction 
is of great necessity to isolate and solve problems properly and efficiently. In 2007, 
Poggio 0 described another set of three levels on learning, namely, “Learning 
theory and algorithms”, “Engineering applications”, and “Neuroscience: models 
and experiments” , respectively. In apart from showing a new perspective, one of 
important contributions of this methodology is on adding a closed loop between 
the levels. These studies are enlightening because they show that complex ob¬ 
jects or systems should be addressed by decompositions with different, yet basic, 
problems. The methodology is considered to be reductionism philosophically. 

In this paper, I propose a novel perspective on machine learning by four levels 
shown in Fig. 1. The levels correspond to four basic problems. The definition of 
each level is given below. 



Fig. 1. Four basic problems (or levels) in machine learning. 













Definition 1: “What to learn” is a study on identifying learning target(s) to 
the given problem(s), which will generally involve two distinct sets of represen¬ 
tations (Fig. 2) defined below. 

Definition la: “Linguistic representation” reflects a high-level description in 
a natural language about the expected learning information. This study is more 
related to linguistics, psychology, and cognitive science. 

Definition lb: “Computational representation” is to define the expected 
learning information based on mathematical notations, ft is relatively a low- 
level representation which generally includes objective functions, constraints, 
and optimization formations. 

Definition 2: “How to learn?” is a study on learning process design and im¬ 
plementations. Probability, statistics, utility, optimization, and computational 
theories will be the central subjects. The main concerns are generalization per¬ 
formance, robustness, model complexity, computational complexity/cost, etc. 
The study may include physically realized system(s). 

Definition 3: “What to evaluate?” is a study on “evaluation measure selec¬ 
tion” where evaluation measure is a mathematical function. This function can 
be the same or different with the objective function defined in the first level. 

Definition 4: “ What to adjust?” is a study on dynamic behaviors of a ma¬ 
chine from adjusting its component(s). This level will enable a machine with a 
functionality of “evolution of intelligence”. 



Fig. 2. Design flow according to the basic problems in machine learning. 


The first level is also called “learning target selection”. The four levels above 
are neither mutually exclusive, nor collectively exhaustive to every problems in 
machine learning. We call them basic so that the extra problems can be merged 
within one of levels. Figs. 1 and 2 illustrate the relations between each level 
in different contexts, respectively. The problems within four levels are all inter¬ 
related, particularly for “What to learn?” and “What to evaluate ?” (Fig. 2). 








‘‘How to learn?” may influence to ‘‘What to learn?”, such as convexity of the 
objective function or scalability to learning algorithms [B] from a computational 
cost consideration. Structurally, “What to adjust?” level is applied to provide 
the multiple closed loops for describing the interrelations (Fig. 1). Artificial in¬ 
telligence will play a critical role via this level. In the “knowledge driven and 
data driven ” model [7], the benefits of utilizing this level are shown from the 
given examples by removable singularity hypothesis to “Sine” function and prior 
updating to Mackey-Glass dataset, respectively. Philosophically, “What to ad¬ 
just?” level remedies the intrinsic problems in the methodology of reductionism 
and offers the functionality power for being holism. However, this level receives 
even less attention while learning process holds a self organization property. 

I expect that the four levels show a novel perspective about the basic prob¬ 
lems in machine learning. Take an example shown in Fig. 3 [after Duda, et 
al, 0, Fig. 5-17] . Even for the linearly separable dataset, the learning func¬ 
tion using least mean square (LMS) does not guarantee a “minimum-error” 
classification. This example demonstrates two points. First, the computational 
representation of LMS is not compatible with the linguistic representation of 
“minimum-error” classification. Second, whenever a learning target is wrong in 
the computational representation, one is unable to reach the goal from Levels 2 
and 3. Another example in Fig. 4 shows why we need two sub-levels in learning 
target selection. For the given character (here is Albert Einstein), one does need 
a linguistic representation to describe “(un)likeness” [9] between the original im¬ 
age and caricature image. Only when a linguistic representation is well defined, 
is a computational measure of similarity possibly proper in caricature learning. 
The meaning of possibly proper is due to the difficulty in the following definition. 



Fig. 3. Learning target selection within linearly separated dataset. 

(after [S] on Fig. 5-17). Black Circle = Class 1, Ruby Square = Class 2. 


Definition 5: “Semantic gap” is a difference between the two sets of repre¬ 
sentations. The gap can be linked by two ways, namely, a direct way for describing 
a connection from linguistic representation to computational representation, and 
an inverse way for a connection opposite to the direct one. 

In this paper, I extend the definition of the gap in m by distinguishing 
two ways. The gap reflects one of the critical difficulties in machine learning. 
For the direct-way study, the difficulty source mostly comes from ambiguity and 
subjectivity of the linguistic representation (say, on mental entity), which will 





a) Original image b) Caricature image 


Fig. 4. Example of “What to learn?” and a need of defining a linguistic 
representation of similarity for the given character, a) Original image 
(http://en.wikipedia.org/wiki/Albert_Einstein). b) Caricature image drawn 
by A. Hirschfeld (http://www.georgejgoodstadt.com/goodstadt/hirschfeld.dca). 


lead to an ill-defined problem. While sharing the same problem, an inverse-way 
study will introduce an extra challenge called ill-posed problem, in which there 
is no unique solution (say, from a 2D image to 3D objects). 

Up to now, we have missed much studies on learning target selection if com¬ 
paring with a study of feature selection. When “ What to learn?” is the most 
primary problem in machine learning, we do need a systematic, or comparative, 
study on this subject. The investigations from mm into discriminative and 
generative models confirm the importance of learning target selection in the vein 
of computational representation. From the investigations, one can identify the 
advantages and disadvantages of each model for applications. A better machine 
gaining the benefits from both models is developed [T3j. Furthermore, the sub¬ 
ject of “What to learn?” will provide a strong driving force to machine learning 
study in seeking “the fundamental laws that govern all learning processes” [14] . 

Take a decision rule about “Less costs more ”0 for example. Generally, Chi¬ 
nese people classify object’s values according to this rule. In Big Data processing, 
the useful information, that often belongs to a minority class, is extracted from 
massive datasets. While an English idiom describes it as “Finding a needle in a 
haystack”, the Chinese saying refers to “Searching a needle in a sea (Ad$S§it)”. 
Users may consider that an error from a minority class will cost heavier than that 
from a majority class in their searching practices. This consideration will derive 
a decision rule like “Less costs more ” . The rule will be one of the important 
strategies in Big Data processing. Two questions can be given to the example. 
What is the mathematical principle (or fundamental law) for supporting the de¬ 
cision rule of “Less costs more”7 Is it a Bayesian rule? Machine learning study 
does need to answer the questions. 


1 This rule is translated from Chinese saying, “ ^” in Pinyin “Wu Yi Xi 
Wei Gui”. The translation is modified from the English phase “Less is more” which 
usually describes simplicity in design. 







3 Information Theoretical Learning 

Shannon introduced “entropy” concept as the basis of information theory H5I: 

H ( Y ) =-^p(.y) l °g2P{y), (!) 

V 

where Y is a discrete random variable with probability mass function p(y) . En¬ 
tropy is an expression of disorder to the information. From this basic concept, 
the other information measures (or entropy functions) can be formed (Table 1), 
where p(t, y ) is the joint distribution for the target random variable T and pre¬ 
diction random variable Y, and p{t) and p(y) are called marginal distributions. 
We call them measures because some of them do not satisfy the metric properties 
fully, like KL divergence (asymmetric). More other measures from information 
theory can be listed as learning criteria, but the measures in Table 1 are more 
common and sufficiently meaningful for the present discussion. 


Table 1. Some information formulas and their properties as learning measures. 


Name 

Formula 

(Dis)similarity 

(A)symmetry 

Joint Information 

H(T, V) = - £ £ pit, y) log 2 p{t, y) 
t y 

Inapplicable 

Symmetry 

Mutual Information 

IiT,\) £ £ Pit, y) log 2 p p ^ y) 
t y 

Similarity 

Symmetry 

Conditional Entropy 

H (y|T) = - £ £ pit, y) log 2 p(y\t) 
t y 

Dissimilarity 

Asymmetry 

Cross Entropy 

HiT-Y) = -£pt(z) log 2 p y iz) 

Z 

Dissimilarity 

Asymmetry 

KL Divergence 

KLiT,Y) = Y J Ptiz) log 2 f^ 

Z 

Dissimilarity 

Asymmetry 


We can divid the learning machines, in view of “mathematical principles”, 
within two groups. One group is designed based on the empirical formulas, like 
error rate or bound, cost (or risk), utility, or classification margins. The other is 
on information theory bub]. Therefore, a systematic study seems necessary to 
answer the two basic questions below m- 

Ql: When one of the principal tasks in machine learning is to process data, can 
we apply entropy or information measures as a generic learning target for 
dealing with uncertainty of data in machine learning? 

Q2: What are the relations between information learning criteria and empirical 
learning criteria, and the advantages and limitations in using information 
learning criteria? 

Regarding to the first question, Watanabe mm proposed that “learning is 
an entropy-decreasing process” and pattern recognition is “a quest for minimum 
entropy The principle behind entropy criteria is to transform disordered data 

















into ordered one (or pattern). Watanabe seems to be the first ‘‘to cast the prob¬ 
lems of learning in terms of minimizing properly defined entropy functions ” |20] , 
and throws a brilliant light on the learning target selection in machine learning. 

In 1988, Zellner theoretically proved that Bayesian theorem can be derived 
from the optimal information processing rule [21]. This study presents a novel, 
yet important, finding that Bayesian theory is rooted on information and opti¬ 
mization concepts. Another significant contribution is given by Principe and his 
collaborators |22lllj for the proposal of Information Theoretical Learning (ITL) 
as a generic learning target in machine learning. We consider ITL will stimulate 
us to develop new learning machines as well as “theoretical interpretations” of 
learning mechanisms. Take again the example of the decision rule about “Less 
costs more”. Hu [23] demonstrates theoretically that Bayesian principle is unable 
to support the rule. When a minority class approximates to a zero population, 
Bayesian classifiers will tend to misclassify the minority class completely. The 
numerical studies [23124] show that mutual information provides positive exam¬ 
ples to the rule. The classifiers based on mutual information are able to protect 
a minority class and automatically balance the error types and reject types in 
terms of population ratios of classes. Theses studies reveal a possible mathemat¬ 
ical interpretation of learning mechanism behind the rule. 

4 (Dis)similarity Measures in Machine Learning 

When mutual information describes similarity between two variables, the other 
information measures in Table 1 are applied in a sense of dissimilarity. For a 
better understanding of them, their graphic relations are shown in Fig. 5. If 
we consider the variable T provides a ground truth statistically (that is, p(t) = 
(pi, ...,p m ) with the population rate Pi(i = l,...,m) is known and fixed), its 
entropy H(T) will be the baseline in learning. In other words, when the following 
relations hold, 


J(T, Y) = H(T ; Y) = H(Y ; T) = H(Y) = H(T ), or 
KL(T, Y) = KL(Y,T) = H(T\Y) = H(Y\T) = 0, (2) 

we call the measures reach the baseline of H(T). 

Based on the study in [26], further relations are illustrated in Fig. 6 between 
exact classifications and the information measures. We apply the notations of 
E, Rej 1 A,CR for the error , reject, accuracy, and correct recognition rates , re¬ 
spectively. Their relations are given by: 


CR + E + Rej = 1, 


A = 


CR 

CR + E' 


(3) 


The form of {yk} = {tk} in Fig. 6 describes an equality between the label 
variables in every samples. For a finite dataset, the empirical forms should be 






H(T,Y) 



H(T,Y) 


KL(T,Y) 


KL(Y,T) 


Fig. 5. Graphic relations among joint information, mutual information, 
marginal information, conditional entropy, cross entropy and KL divergences 
(modified based on [2S] by including cross entropy and KL divergences). 


used for representing the distributions and measures [26]. Note that the link 
using “■<->" indicates a two-way connection for equivalent relations, and >•” for 
a one-way connection. Three important aspects can be observed from Fig. 6: 

I. The necessary condition of exact classifications is that all the information 
measures reach the baseline of H(T). 

II. When an information measure reaches the baseline of H(T), it does not 
sufficiently indicate an exact classification. 

III. The different locations of one-way connections result in the interpretations 
why and where the sufficient condition exists. 


Joint 



Fig. 6. Relations between exact classifications and mutual information, 
conditional entropy, cross entropy and KL divergences. 


Although Fig. 6 only shows the relations to the information measures listed 
in Table 1 for the classification problems, its observations may extend to other 
information measures as well as to the other problems, like clustering, feature se¬ 
lection/extraction, image registrations, etc. When we consider machine learning 
or pattern recognition to be a process of data in a similarity sense (any dis¬ 
similarity measure can be transformed into similarity one [26]), one important 
theorem exists to describe their relations. 












































Theorem 1. Generally, there is no one-to-one correspondence between the 
empirically-defined similarity measures and information measures. 

The proof is neglected in this paper, but it can be given based on the study 
of bounds between entropy and error (cf. [27] and references therein). The signif¬ 
icance of Theorem 1 implies that an optimization of information measure may 
not guarantee to achieve an optimization of the empirically-defined similarity 
measure. 

5 Final remarks 

Machine learning can be exploited with different perspectives depending on 
study goals of researchers. For a deep understanding of ourself on the learning 
mechanisms mathematically, we can take learning machines as human’s extended 
sensory perception. This paper stresses on identifying the primary problem in 
machine learning from a novel perspective. I define it as “ What to learn?” or 
“learning target selection”. Furthermore, two sets of representations are specified, 
namely, “ linguistic representation” and “ computational representation”. While a 
wide variety of computational representations have been reported in learning 
targets, we can argue that if there exists a unified, yet fundamental, principle 
behind them. Towards this purpose, this paper extends the Watanabe’s pro¬ 
posal mm and the studies from Zellner m and Principe pQ to a “conjecture 
of learning target selection” in the following descriptions. 

Conjecture 1. In a machine learning study, all computational representa¬ 
tions of learning target(s) can be interpreted, or described, by optimization of 
entropy function(s). 

I expect that the proposal of the conjecture above will provide a new driving 
force not only for seeking fundamental laws governing all learning processes nn 
but also for developing improved learning machines |28| in various applications. 
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